1.Introduction

Our client has been conducting a research about clarifying capital economics on both state and nation scale, while a negative relationship between union density and vocational education has been found during researching process which conflicts to the Variety of Capitalism theory. Initially, this project is to explore whether racial diversity can explain the negative relationship between union density and vocational education. Two types of analysis are involved into this project based on this question. One is capturing this relationship using regression modeling method and another is using mediation analysis to see if racial diversity mediates the relationship.

2.Research Question

According to the Varieties of Capitalism (VoC) theory, the United States is defined as a liberal market economies(LME) and one characteristic of this market type is that the union density is positively related to the vocational education. However, According to the linear model that was conducted and presented by the client, the model contains 9 variables from dataset, highschool stands for the number of people engaging in vocational education and urbanity means the quality of state of being urban. The original model shows below:

\[Highschool=4.754*10^5-1.692*10^4Uniondensity+6.629*10^(-2)GDP-1.045*10^3Urbanity+\] \[2.804*10^2Population-51.76Export+1.909*10^4Unemployrate+1.241*10^3Costofliving-9.598*10^5Gini\]

We see that the coefficient of Union Density is negative which indicates there exists a negative relationship between union density and high school. According to VoC theory, LME(Liberal Market Economies) could have positive relationship between union density and vocation education. The result apparently conflicts with the properties of LME in VoC theory. Accordingly, a potential explanation comes that racial diversity in each state might lead to this negative relationship between vocational education and union density.

3.Part I

3.1 Data Interpretation

  • Data description
    The client initially sent us data with 31 variables for 51 states in the US including Washington D.C. Not all the variables are used in the regression model. Based on the content of the project, the dependent variable is the number of high school students in vocational training program which represents the vocational training education, and the independent variables include the predictor and other confounding variables. The predictor is the union density, and the confounding variables are proportion of White, Black and Hispanic people, GDP, urbanity, population, export, cost of living, unemployment rate and Gini index. Since the client’s question involves looking at the racial diversity, we computed a measure of racial diversity using the “Generalized Variance” method found in “How to Measure Density When You Must”(David V. Budescu, 2012), using the proportions of White, Black, and Hispanic to the population in each state. This is because the proportions of the races alone do not represent racial diversity.

  • Data governance
    There are several issues with the dataset that may have an implication on the analysis we conduct. First, Data are sourced from different years, from 2016 to 2018. Second, misrepresentation of vocational education. In the original dataset, the vocation education is represented by the number of high school students, which is on the demand side of the vocational education. However, the VoC uses theory to classify countries based on the actions of the government and business. Therefore, it might be more helpful to use the supply side of vocational education rather than the demand side. Third, misrepresentation of racial diversity. The racial diversity is initially represented by the number of White / Black / Hispanic people. However, the proportion of a single race does not give a measure of racial diversity based on the literature mentioned before. Rather, the normalized racial diversity that is calculated by the proportion of different races is more reasonable. Therefore, a new variable named “normalized racial diversity” is added into the dataset.

To obtaining the racial diversity, first of all, calculating the proportion of the same race, which is equal to the sum of the proportion of Black squared, proportion of White squared, and proportion of Hispanic squared. Then use one to minus the proportion of same race to get the proportion of different races. However, there are many different ways to collect data on the racial groups, especially which races to collect data on. To alleviate this kind of heterogeneity that can arise when modeling, racial diversity needs to be normalized before adding to the model. We multiply the radical diversity by C/(C-1) can get the normalized racial diversity, where C means the number of race categories. In this case, C is 3.

\[ Racial Diversity=1-White^2-Black^2-Hispanic^2\] \[Normal Racial Diversity = Racial Diversity*(3/2)\]

3.2 Exploratory Data Analysis

This plot visualizes the potential relationship with x-axis set as union density and y-axis set as vocational education. From this plot, there seems to be a negative relationship. But because there are so many leverage points, we can not conclude that this negative relationship actually exists or it is a result of the leverage points.

Moreover, a bubble plot is presented. Similar to the previous plot, we set union density as the independent variable and high school as the dependent variable. However, this time, we replace every data point with a single circle. Each circle represents one state and the radius of the circle represents the racial diversity level, which means, the higher racial diversity level, the larger the circle is. Similar to previous results, by observing this plot we cannot confirm that these two variables have a negative relationship. Furthermore, there does not seem to be a clear pattern in the racial diversity in the states.

3.3 Methodology and Result

3.3.1 Model I

  • Replicate Original Model
    Since the initial EDA does not suggest a distinctive pattern of the negative relationship between vocational education and unionization, we replicated the model used by the client to confirm the model results. We are able to replicate her results, the coefficient is the same for the results client provided to us. The negative coefficient at -16915 indicates that for every unit increase in union density, the number of high school students is likely to decrease by 16915.

  • Partial Correlation
    Because there are other factors that will impact the relationship between the outcome variable, vocational education, and union density, partial correlation analysis is implemented to better investigate the relationship between union density and vocational education. Partial correlation is a measure of the strength and direction of a linear relationship between two continuous variables while controlling for the effect of predictors in the regression model. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and −1. A value of +1 and −1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker.

When conducting partial correlation analysis, a method for computing the partial correlation coefficients has to be specified. Two methods are used in this analysis. One is the Pearson method, which evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable. The following formula is used to calculate the Pearson correlation:

\[r_{xy} = \frac{n\sum{x_iy_i}-\sum{x_i}\sum{y_i}}{\sqrt{n\sum{x_i}^2-(\sum{x_i})^2}\sqrt{n\sum{y_i}^2-(\sum{y_i})^2}}\] \(r_{xy}\) = Pearson r correlation coefficient between x and y
\(n\) = number of observations
\(x_i\) = value of x (for ith observation)
\(y_i\) = value of y (for ith observation)

Another is the Spearman method, which evaluates the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal. The following formula is used to calculate the Spearman rank correlation:

\[\rho = 1 - \frac{6\sum{d_i^2}}{n(n^2-1)}\]

\(\rho\) = Spearman rank correlation
\(d_i\) = the difference between the ranks of corresponding variables
\(n\) = number of observations

Using the pearson method, the partial correlation is −0.4167175 and P-value is 0.002351165, which suggests a moderate negative relationship between union density and vocational education, and we expected vocational education decreases 0.417 unit for one unit increase in union density when all other covariates are held constant. Using the spearman method, the partial correlation is −0.27575 and P-value is 0.008161668, which also indicates a moderate negative relationship between union density and vocational education, we expected vocational education decreases 0.276 unit for one unit increase in union density when all other covariates are held constant.

From the output of the two methods, correlations between union density and vocational education are negative. The small P-value (<0.05) indicates that the negative coefficient of Union density is statistically significant in the primary model, it means here that it’s not equal to 0, so it suggests that the variables are related. From the partial correlation plot on residual of Union density and residual of Highschool, the red line, which represents the relationship between union density and vocational education, has an apparent downward trend. However, by looking at the points on the graph, there are leverage points in the lower right corner, which indicates that some data transformation might be needed. The next step is to validate the model by checking the assumptions.

  • Assumption Check

The plots above are used to check the assumptions of Linear Regression. These assumptions are linearity assumption, normality assumption, and constant variance assumption. The first plot shown here is Residuals vs Fitted plot which is used to check linearity assumption. An ideal plot should show the equally spread residuals around the horizontal line without any distinct patterns. The plot above indicates the violation of the linearity assumption. The normality assumption can be checked by the Q-Q plot. The ideal plot of residuals should approximately follow a straight line. Some points are skewed at the top right of the plot which indicates the violation of normality assumption. The third plot, which is a scale-location plot can be used to check the assumption of equal variance. An ideal plot will show a horizontal line with randomly spread points. In our case, the constant variance assumption is violated. If the assumptions of Linear Regression are not met, it indicates that a transformation in the linear regression is required.

To use the linear regression model, the most significant assumption is that the predictors should be linearly related to the outcome, while accounting for other predictors. Component residual plot is used to plot the relationships between each predictor and the outcome, while accounting for the other predictors. The red line shows a smoothed fit of the data points, and the dashed line indicates the linear best fit. If the red line is close to the dashed line and does not show a curved pattern, it suggests that the linearity assumption is satisfied for that variable. Based on the initial component residual plot of the original model, leverage points exist in data points from different variables including “GDP”, “population”, “Export”, “unemployment rate” and “cost of living”. There are issues with leverage points since it would directly change the pattern of the residual plots. For example, looking at the residual plots of “Export”, if the leverage points move upwards or downwards, the red line and the dashed line would move accordingly. This is not ideal because the red line should indicate the pattern of the majority of the data points, and should not be heavily influenced by just one or two leverage points. To avoid the leverage points, log transformation is used to build the first model, “fit1”.

3.3.2 Model II

  • Original Model Tranformation
    Because the original model does not align with the assumptions of linear regression, data transformation is used to adjust the model.
Observations 51
Dependent variable log(Highschool)
Type OLS linear regression
F(8,42) 17.09
0.76
Adj. R² 0.72
Est. S.E. t val. p
(Intercept) 5.20 6.70 0.78 0.44
Union_density -0.04 0.03 -1.58 0.12
log(GDP) -0.44 0.55 -0.80 0.43
Urbanity 0.01 0.01 0.79 0.43
log(Population) 0.95 0.15 6.48 0.00
log(Export) -0.01 0.09 -0.06 0.95
Unemployment_rate 0.02 0.11 0.16 0.87
log(Cost_of_living) -0.63 0.98 -0.64 0.52
Gini -1.59 5.11 -0.31 0.76
Standard errors: OLS

In model “fit1”, log transformation has been implemented for the outcome, “highschool” and variables that have leverage points, “GDP”, “population”, “Export”, “unemployment rate” and “cost of living”. After summarizing the model “fit1”, the coefficient between the outcome “highschool” and the variable “union density” is -0.04, and is not significant. The insignificant coefficient indicates that the negative relationship between vocational education and union density is not conclusive.

After modeling, residual plots are generated to check the validation of the model. Looking at the residual plots in the Appendix, nearly all the residuals are equally distributed around the horizontal line valued at zero, indicating that the residuals are random, and the linearity assumption is satisfied. Looking at the Q-Q plot, a majority of the residual points are aligned with the straight line, indicating the normality assumption is nearly satisfied. Looking at the scale-location plot, most residuals are spread equally along the ranges of predictors, indicating the equal variance assumption is satisfied. Based on the residual plots, the model “fit1” is more valid compared to the original model. Checking the component and residual plots again, the issue regarding the leverage points has been improved a lot. However, two issues still stand. First, there is a leverage point in GDP after taking a log transformation. Second, the variable of union density slightly shows a slight curved pattern, indicating that the linear assumption might not be satisfied, and a more robust model might need to be considered. However, for this analysis, we did not move forward with an analysis using a more robust model.

Observations 50
Dependent variable log(Highschool)
Type OLS linear regression
F(8,41) 15.39
0.75
Adj. R² 0.70
Est. S.E. t val. p
(Intercept) -2.76 8.93 -0.31 0.76
Union_density -0.06 0.03 -2.01 0.05
log(GDP) 0.12 0.68 0.18 0.86
Urbanity 0.01 0.01 0.75 0.46
log(Population) 0.89 0.15 5.84 0.00
log(Export) -0.01 0.09 -0.10 0.92
Unemployment_rate 0.05 0.11 0.46 0.65
log(Cost_of_living) -0.40 0.99 -0.40 0.69
Gini 2.27 5.83 0.39 0.70
Standard errors: OLS

For a sensitivity analysis, to determine whether the leverage point has a big impact on our regression results, we conduct an analysis by removing the leverage point. The leverage point is Washington DC which has the highest log GDP per capita. Another model (fit2) with exactly the same predictors and outcome is created with a new dataset which excludes the datapoint of Washington DC. After summarizing the model “fit2”, the coefficient of union density changes 20%( from -0.04 in the previous model with DC to -0.066 in this model), indicating that the difference is not ignorable. Thus, the leverage point of Washington DC cannot be simply left out.

  • Partial Correlation

Using the pearson method, the partial correlation is −0.2375583 and P-value is 0.09323562, which suggests we expected vocational education decreases 0.238 unit for one unit increase in union density when all other covariates are held constant.

Using the spearman method, the partial correlation is −0.2179186 and P-value is 0.1244902, which is similar to the result of pearson method. We expected vocational education decreases 0.217 unit for one unit increase in union density when all other covariates are held constant. After transformed model, the output of the two methods are pretty close.

Partial correlation analysis is implemented again with transformed variables to check the relationship between union density and vocational education. Looking at the partial correlation plot, the red line becomes flatter. Based on the output from two methods, the partial correlation coefficients between union density and vocational education are smaller, and p-values are larger. Thus, in the transformed model, the negative relationship between the two union density and vocational training is not as strong as it is in the original model.

3.3.3 Model III

  • Adding Racial in Transformed Model
    To analyze whether racial diversity can explain the relationship,we add racial diversity as another confounding variable into the next model and check whether there are any differences in the coefficients.
Observations 51
Dependent variable log(Highschool)
Type OLS linear regression
F(9,41) 17.69
0.80
Adj. R² 0.75
Est. S.E. t val. p
(Intercept) 7.45 6.40 1.16 0.25
Union_density -0.02 0.03 -0.91 0.37
log(GDP) -0.05 0.54 -0.10 0.92
Urbanity -0.01 0.01 -0.62 0.54
log(Population) 0.80 0.15 5.27 0.00
log(Export) 0.10 0.10 1.01 0.32
Unemployment_rate -0.13 0.12 -1.07 0.29
log(Cost_of_living) -1.71 1.03 -1.66 0.10
Gini -1.01 4.83 -0.21 0.84
norm_Racial_diversity 1.59 0.65 2.46 0.02
Standard errors: OLS

The residual plots of adding the normalized racial diversity model are similar to the transformed model; the only change is that adjusted R-squared increases a little bit, from 0.7202 to 0.7503, meaning that this model captures more information from the dataset. R-squared is the proportion of variance explained by the model. High R-squared means the model captures all the variation in the outcome. The adjusted R-squared is a modified version of R-squared and increases only if the new term improves the model more than what we expect by chance. Therefore, after adding normalized racial diversity, the new model fits better than the transformed model, and the coefficient for union density doesn’t change that much.

  • Partial Correlation
    By checking the partial correlation of the third model, the red line is pretty close to the horizontal line, indicating that the negative correlation between union density and vocational training is weak. Furthermore, there is a leverage point on the left side of the plot, and this point may affect this slight slope. This red line may become a horizontal line if removing the leverage point.

Using the pearson method, the partial correlation is −0.1242534 and P-value is 0.3849993. We expected vocational education decreases 0.125 unit for one unit increase in union density when all other covariates are held constant.

Using the spearman method, the partial correlation is −0.1408109 and P-value is 0.09939286. We expected vocational education decreases 0.141 unit for one unit increase in union density when all other covariates are held constant. Then, from the output of different methods, the partial correlations between union density and vocational education are closer to zero than the transformed model, and P-values (>0.05) are quite large, which indicates this relationship between these two factors is not statistically significant. Therefore, it cannot conclude the correlations are different from zero, which suggests there may not be a relationship between union density and vocational training.

Conclusion

To summarize the results, two findings are found based on the research question. First, there is no significant relationship between union density and vocational education. Secondly, the racial diversity changes the results of linear model to some extent from -0.04 to -0.02. But for the question of whether the racial diversity explains the relationship, no conclusion can be made. Therefore, in the second pass, mediation analysis is implemented to further explore whether racial diversity mediates the relationship between union density and vocational education.

4.Part II

4.1 Data Interpretation

  • Data description
    The second dataset is received from clients to implement the mediation analysis. After the client put more thought into the confounders that need to be taken into account, the data consists of the following:
Variable Name Interpretation Time Span
Urbanity Quality of being urban in each state 2010
Unemployment Unemployment rate in each state 2009-2017
GDP_Per_Capital GDP/Population number 2005-2017
Union Union degree for each state 2005-2017
Voe Vocational training concentrators 2007-2017
Racial(black/hispanic/white) Racial proportion for each race 2005-2017
Totalpop Total population 2009-2017
Studentpop Student population 2009-2017
  • Data governance
    Two main issues are found before starting the analysis. First, the dataset has some missing data. Look into the dataset values for some variables between 2005 and 2009 are missing. Drop this part of data, the model should be built from 2009 to 2017. Besides, the dataset only contains urbanity of 2010, the same value is applied to other years which could influence the analyzing part.

4.2 Methodology and Result

This first part of this section includes a series of models based on data for 2010, the later part includes data from 2009-2017. Both parts concentrate on measuring the relationship between union density and vocation training given the new edition of data. Besides, in order to value how strong racial diversity being a mediator could influence such relationship, mediation analysis has been applied in this part.

4.2.1 Model I:Linear Model on Year 2010

Observations 37
Dependent variable Voe
Type OLS linear regression
F(5,31) 8.34
0.57
Adj. R² 0.50
Est. S.E. t val. p
(Intercept) 208754.90 128169.24 1.63 0.11
Union -6044.25 2939.69 -2.06 0.05
Urbanity -1823.94 1425.88 -1.28 0.21
Unemployment -13921.43 9041.65 -1.54 0.13
GDP_Per_Capita 0.94 2.05 0.46 0.65
Totalpop 0.02 0.00 5.65 0.00
Standard errors: OLS

The first linear model is the regression of vocational education concentrator on union density, Urbanity, Unemployment rate, GDP per Capita and Total Population on the year 2010. Because the urbanity only has data at the year of 2010, we want to do a linear regression on the correct data first before fitting repeated urbanity data. From first model summary, the coefficient of intercept is 208754.90 which means a state level’s vocation education level should be 208754.90 when this state’s union density, urbanity, unemployment, GDP per capita and total population are all valued 0. The coefficient of variable union is -6044.25 which means a unit increase in union density results in an increase in average vocational education level decrease by -6044.25, all other variables held constant.Based on the diagnostic plots from the appendix, the assumptions of linear regression are violated. Log transformation on the outcome and predictors will be used here.

4.2.2 Model II:Model I Transformation

Observations 37
Dependent variable log(Voe)
Type OLS linear regression
F(5,31) 10.04
0.62
Adj. R² 0.56
Est. S.E. t val. p
(Intercept) 2.65 13.76 0.19 0.85
Union -0.02 0.03 -0.64 0.53
Urbanity 0.01 0.02 0.55 0.58
Unemployment -0.11 0.11 -0.99 0.33
log(GDP_Per_Capita) -0.68 1.23 -0.55 0.59
log(Totalpop) 1.01 0.25 4.00 0.00
Standard errors: OLS

Model two is built based on the transformed model.

This model should be interpreted on the log scale. The coefficient of intercept is 2.65 which means a state level’s vocation education level should be \[exp(2.65)\] when this state’s union density, urbanity, unemployment, GDP per capita and total population are all valued 0. The coefficient of variable union is -0.02 which means a unit increase in union density results in a 2% decrease in average vocational education level, all other variables held constant. The coefficient of total pop is 1.01 which means a 1% increase in total population results in a 1.01% increase in average vocational education. The diagnostic plots in the appendix show that the assumptions are met. We do not have enough evidence to reject that there is no relationship between vocational education concentrator and union density based on the P-value(>0.05).

4.2.3 Model III:Model II Add Racial Diversity

Observations 37
Dependent variable log(Voe)
Type OLS linear regression
F(6,30) 8.39
0.63
Adj. R² 0.55
Est. S.E. t val. p
(Intercept) -3.53 15.73 -0.22 0.82
Union -0.03 0.04 -0.93 0.36
Urbanity 0.01 0.02 0.58 0.57
Unemployment -0.07 0.12 -0.60 0.55
log(GDP_Per_Capita) -0.17 1.38 -0.13 0.90
log(Totalpop) 1.08 0.27 4.04 0.00
norm_Racial_diversity -0.81 0.98 -0.83 0.42
Standard errors: OLS

Model three is built based on the transformed model adding normalized racial diversity. From the summary of this model, the P-value of union density is still large and the coefficient of union density is close to 0. The interpretation should follow same rule as models above. Racial diversity is unable to explain the negative relationship between union diversity and vocational training.

4.2.4 Mediation Analysis on Model III

To investigate whether racial diversity actually mediates the relationship between union density and vocational education, mediation analysis is conducted. A mediation model proposes that the independent variable influences the mediator variable, which in turn influences the dependent variable. In this project, the union density is the independent variable which serves as the treatment, vocational education is the dependent variable acts as the outcome. Racial diversity is the mediator. This mediation analysis explores whether if union density affects racial diversity first, and racial diversity then affects vocational education. Because mediation analysis is a causal inference model, the assumption of no other unmeasured confounding variables in the model is made.

The direct effect is the effect between treatment (independent variable) and outcome (dependent variable). The mediation effect is the effect between the treatment and outcome, that goes through the mediator. In the project, the direct effect is the effect of union density and vocational training, through racial diversity. In the mediation analysis, we assume that the sum of the direct effect and the mediation effect is the total effect. To interpret the results of mediation analysis, the ACME stands for average casual mediation effect, and ADE stands for average direct effect.

## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value
## ACME             0.0131      -0.0181         0.05    0.43
## ADE             -0.0345      -0.1064         0.04    0.34
## Total Effect    -0.0214      -0.0866         0.05    0.53
## Prop. Mediated  -0.1369      -8.6769         8.44    0.74
## 
## Sample Size Used: 37 
## 
## 
## Simulations: 1000

Based on the summarized results of mediation analysis using the data on 2010, the direct effect is -0.033, which is not statistically significant, indicating that union density and vocational training does not have strong relationship. Furthermore, the estimate for mediation effect is 0.01, which is very close to 0, indicating that there is nearly no mediation effect. The proportion of mediation effects is -0.16. Looking at the plot, a large portion of the total effect is caused by direct effect, and the mediation effect is around zero, which corresponds to the prop.mediated result at -0.16 (16%).

4.2.5 Model IV:All Year Model Transformation

Observations 298
Dependent variable log(Voe)
Type OLS linear regression
F(13,284) 33.88
0.61
Adj. R² 0.59
Est. S.E. t val. p
(Intercept) -0.16 4.21 -0.04 0.97
Union -0.04 0.01 -3.28 0.00
Urbanity 0.01 0.01 0.97 0.33
Unemployment -0.17 0.04 -4.35 0.00
log(GDP_Per_Capita) -0.40 0.36 -1.10 0.27
log(Totalpop) 1.05 0.08 13.05 0.00
factor(Year)2010 0.02 0.18 0.12 0.91
factor(Year)2011 -0.29 0.19 -1.53 0.13
factor(Year)2012 -0.24 0.20 -1.21 0.23
factor(Year)2013 -0.12 0.20 -0.63 0.53
factor(Year)2014 -0.49 0.21 -2.31 0.02
factor(Year)2015 -0.45 0.22 -2.05 0.04
factor(Year)2016 -0.59 0.25 -2.37 0.02
factor(Year)2017 -0.65 0.26 -2.49 0.01
Standard errors: OLS

Model four is the model using data from 2009 to 2017. The coefficient of union density is -0.04, which means every 1 unit increases on union density, the Voe expect to decrease 4% \[exp(-0.04)\]. The coefficients of different year factor is the added on value for coefficients for each variables, setting the year of 2009 as a baseline. For example, the coefficient of union density at the year of 2009 is -0.04, and the coefficient of union density at the year of 2010 is -0.02 (-0.04 + 0.02).

By checking the QQ plot in Appendix, there is one outlier in the lower left, and it is the data for Louisiana in 2010. However, it is unreasonable to remove one year data of one state, so we keep it in the following model.

4.2.6 Model V:Model IV Add Racial Diversity

Observations 298
Dependent variable log(Voe)
Type OLS linear regression
F(14,283) 31.38
0.61
Adj. R² 0.59
Est. S.E. t val. p
(Intercept) -0.62 4.34 -0.14 0.89
Union -0.04 0.01 -3.23 0.00
Urbanity 0.01 0.01 1.05 0.29
Unemployment -0.17 0.04 -3.89 0.00
log(GDP_Per_Capita) -0.36 0.37 -0.98 0.33
log(Totalpop) 1.06 0.08 12.97 0.00
norm_Racial_diversity -0.14 0.31 -0.45 0.65
factor(Year)2010 0.02 0.18 0.09 0.93
factor(Year)2011 -0.29 0.19 -1.52 0.13
factor(Year)2012 -0.24 0.20 -1.18 0.24
factor(Year)2013 -0.12 0.20 -0.63 0.53
factor(Year)2014 -0.48 0.21 -2.24 0.03
factor(Year)2015 -0.43 0.22 -1.93 0.05
factor(Year)2016 -0.56 0.25 -2.23 0.03
factor(Year)2017 -0.63 0.27 -2.32 0.02
Standard errors: OLS

After adding the normalized racial diversity to model 4. Therefore, it is hard to say that racial diversity can explain the negative relationship between union density and Voe.

4.2.7 Mediation Analysis on Model V

## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME            0.00156     -0.00635         0.01    0.71    
## ADE            -0.03760     -0.06093        -0.01  <2e-16 ***
## Total Effect   -0.03603     -0.05853        -0.01  <2e-16 ***
## Prop. Mediated -0.03514     -0.36657         0.21    0.71    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 298 
## 
## 
## Simulations: 1000

Using the data from 2009 to 2017, the direct effect is -0.0375 but this time it becomes significant, this might indicate that union density and vocational education have negative relationship. But because this finding is based on two assumptions, one is urbanity is the same value as 2010 for all the years, the other is all the confounding variables are included in the model. However, the mediation effect is 0.001, which is around zero. And based on the plot, most of the effect is caused by direct effect and mediation effect ranged around zero, indicating that racial diversity does not mediate the relationship between union density and vocational education.

4.2.8 Model VI:Model IV Using Student Population and Racial

Observations 298
Dependent variable log(Voe)
Type OLS linear regression
F(14,283) 31.87
0.61
Adj. R² 0.59
Est. S.E. t val. p
(Intercept) 1.79 4.22 0.42 0.67
Union -0.04 0.01 -3.53 0.00
Urbanity 0.01 0.01 0.99 0.32
Unemployment -0.16 0.04 -3.75 0.00
log(GDP_Per_Capita) -0.30 0.37 -0.80 0.42
log(StudentpopE) 1.06 0.08 13.13 0.00
norm_Racial_diversity -0.26 0.31 -0.83 0.40
factor(Year)2010 0.00 0.18 0.01 0.99
factor(Year)2011 -0.27 0.19 -1.47 0.14
factor(Year)2012 -0.21 0.20 -1.03 0.30
factor(Year)2013 -0.08 0.20 -0.41 0.68
factor(Year)2014 -0.41 0.21 -1.93 0.05
factor(Year)2015 -0.35 0.22 -1.59 0.11
factor(Year)2016 -0.47 0.25 -1.87 0.06
factor(Year)2017 -0.52 0.27 -1.96 0.05
Standard errors: OLS

The dataset contains total population and student population. Because the client wants to know how the student population would influence the results, we with the total population to student population. However, the interpretation of the results would be different. The research question is now targeting to the proportion of students taking vocational education from the total student population rather than the state population. The coefficient of union density is -0.04, which means every 1 unit increases on union density, the VoE expect to decrease 4% \[exp(-0.04)\].

4.2.9 Mediation Analysis on Model VI

## 
## Causal Mediation Analysis 
## 
## Quasi-Bayesian Confidence Intervals
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME            0.00339     -0.00462         0.01     0.4    
## ADE            -0.04118     -0.06516        -0.02  <2e-16 ***
## Total Effect   -0.03779     -0.05980        -0.02  <2e-16 ***
## Prop. Mediated -0.08761     -0.40654         0.15     0.4    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 298 
## 
## 
## Simulations: 1000

For this mediation analysis, student population has replaced the total population. Again, most of the effect is caused by direct effect rather than mediation effect, indicating that racial diversity does not mediate the relationship between union density and vocational education.

4.2.10 Model VII:Model IV Exclude DC

Observations 297
Dependent variable log(Voe)
Type OLS linear regression
F(13,283) 32.82
0.60
Adj. R² 0.58
Est. S.E. t val. p
(Intercept) -1.27 4.53 -0.28 0.78
Union -0.04 0.01 -3.34 0.00
Urbanity 0.01 0.01 0.95 0.34
Unemployment -0.16 0.04 -3.84 0.00
log(GDP_Per_Capita) -0.29 0.40 -0.72 0.47
log(Totalpop) 1.05 0.08 12.81 0.00
factor(Year)2010 0.01 0.18 0.06 0.96
factor(Year)2011 -0.29 0.19 -1.56 0.12
factor(Year)2012 -0.24 0.20 -1.21 0.23
factor(Year)2013 -0.13 0.20 -0.67 0.50
factor(Year)2014 -0.49 0.21 -2.30 0.02
factor(Year)2015 -0.44 0.22 -2.00 0.05
factor(Year)2016 -0.56 0.25 -2.27 0.02
factor(Year)2017 -0.63 0.27 -2.38 0.02
Standard errors: OLS

The component and residual plot of model four shows that there are some leverage points in GDP per Capita. These leverage points are DC. Thus, the model seven is built on using data which exclude DC. The results indicate that removing leverage points does not change much compared to model four.

4.2.11 Model VIII:Model IV Drop the Urbanity

Observations 298
Dependent variable log(Voe)
Type OLS linear regression
F(13,284) 33.70
0.61
Adj. R² 0.59
Est. S.E. t val. p
(Intercept) -3.23 3.56 -0.91 0.37
Union -0.04 0.01 -3.15 0.00
Unemployment -0.17 0.04 -3.93 0.00
log(GDP_Per_Capita) -0.15 0.31 -0.49 0.63
log(Totalpop) 1.11 0.06 17.67 0.00
norm_Racial_diversity -0.06 0.30 -0.19 0.85
factor(Year)2010 -0.01 0.18 -0.05 0.96
factor(Year)2011 -0.32 0.18 -1.73 0.08
factor(Year)2012 -0.28 0.20 -1.40 0.16
factor(Year)2013 -0.16 0.19 -0.81 0.42
factor(Year)2014 -0.53 0.21 -2.50 0.01
factor(Year)2015 -0.48 0.22 -2.17 0.03
factor(Year)2016 -0.61 0.25 -2.44 0.02
factor(Year)2017 -0.68 0.26 -2.58 0.01
Standard errors: OLS

Our client informed us later that urbanity may not be confouding variable for Vocational Education. Therefore, model eight is regression without urbanity. The results indicate that removing urbanity does not change much compared to model four.

4.3 Conclusion

In conclusion, the coefficients of union density in these models are similar and tend to be significant. According to those coefficients, union density has negative relationship with vocational education. However, these models are built based on two assumptions. First, we assume there are no other unmeasured confounding varibales. Second, we assume urbanity is the same for every year using the data of 2010. Second, racial diversity has no mediation effect on the negative relationship between union density and vocational education based on the results of mediation analysis.

Appendix

Part1: Results of orignal model

Observations 51
Dependent variable Highschool
Type OLS linear regression
F(8,42) 10.66
0.67
Adj. R² 0.61
Est. S.E. t val. p
(Intercept) 475390.39 501428.38 0.95 0.35
Union_density -16915.99 5693.93 -2.97 0.00
GDP 0.07 1.36 0.05 0.96
Urbanity -1044.78 1826.13 -0.57 0.57
Population 0.03 0.00 7.31 0.00
Export -0.52 0.55 -0.95 0.35
Unemployment_rate 19091.11 24534.93 0.78 0.44
Cost_of_living 1240.92 1669.18 0.74 0.46
Gini -959776.82 1142093.37 -0.84 0.41
Standard errors: OLS

Part1: Assumption check of transformed model

Part1: Assumption check and crplot when remove dc

Part1: Assumption check and crplot when add racial diveristy

Part2: Assumption check and crplot when year 2010

Part2: Assumption check and crplot when transformed year 2010

Part2: Assumption check and crplot when transformed year 2010 and add racial diveristy

Part2: Assumption check and crplot when transformed all year

Part2: Assumption check and crplot when transformed all year and add racial diveristy

Part2: student pop

Part2: remove dc

Part2: remove urbanity